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We present a simple mathematical framework and API for parallel mesh and data distribution, load balanc¬ 
ing, and overlap generation. It relies on viewing the mesh as a Hasse diagram, abstracting away information 
such as cell shape, dimension, and coordinates. The high level of abstraction makes our interface both con¬ 
cise and powerful, as the same algorithm applies to any representable mesh, such as hybrid meshes, meshes 
embedded in higher dimension, and overlapped meshes in parallel. We present evidence, both theoretical 
and experimental, that the algorithms are scalable and efficient. A working implementation can be found in 
the latest release of the PETSc libraries. 
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1. INTRODUCTION 

The algorithms and implementation for scalable mesh management, encompassing 
partitioning, distribution, rebalancing, and overlap generation, as well as data man¬ 
agement over a mesh can be quite complex. It is common to divide meshes into col¬ 
lections of entities (cell, face, edge, vertex) of different dimensions which can take a 
wide variety of forms (triangle, pentagon, tetrahedron, pyramid, ...), and have query 
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functions tailored to each specific form (D’Azevedo and et. al. 2015). This code struc¬ 
ture, however, results in many different cases, little reuse, and greatly increases the 
complexity and maintenance burden. On the other hand, codes for adapti ve redistribu¬ 
tion of meshes based on parallel partitioning such as the Zoltan library jPevine et al. 


|2006| , usually represent the mesh purely as an undirected graph, encoding cells and 
vertices and ignoring the topology. For data distribution, interfaces have been special¬ 
ized to each specific function space represented on the mesh. In Zoltan, for example, 
the user is responsible for supplying functions to pack and unpack data from commu¬ 
nication buffers. This process can be automated however, as in DUNE-FEM [|Dedner| 
et al. 2010) which attaches data to entities, much like our mesh points described be¬ 


low. 

We have previously presented a mesh representation which has a single entity 
type, called points , and a single antisymmetric relation, called cov ering |Knepley 


jand Karpeev 200^ . This structure, more precisely a Hasse diagr am [ Birkhoff' 1967 
Wikipedia 2015bT 7can represent any CW-complex jHatcher 2002[ [Wikipedia 2015a) , 
and can be represented algorithmically as a directed acyclic graph (DAG) over the 
points. It comes with two simple relational operations, con®(p), called the cone of p 
or the in-edges of point p in the DAG, and its dual operation »iupp(p), called the sup¬ 
port of p or the out-edges of point p. In addition, we will add the transitive closure 
in the DAG of these two operations, respectively the closure dl(p) and star st{p) of 
point p. In Fig. we show an example mesh and its corresponding DAG, for which 
we have con®(^ = {a,b,e} and siupp(/3) = {a, c, e}, and the transitive closures 
c]l(A) = {A, a, b, e, a, (3, 7 } and »t{/3) = {/3, a, c, e, A, B}. 


7 



/3 



Fig. 1. A simplicial doublet mesh and its DAG (Hasse diagram). 


In our prior work [Knepley and Karpeev 20^ 1, it was unclear whether simple generic 
algorithms for parallel mesh management tasks could be formulated, or various types 
of meshes would require special purpose code despite the generic mesh representation. 
Below, we present a complete set of generic algorithms, operating on our generic DAG 
representation, for parallel mesh operations, including partitioning, distribution, re- 
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balancing, and overlap generation. The theoretical underpinnings and algorithms are 
laid out in Section]^ and experimental results detailed in Section]^ 


2. THEORY 


2.1. Overlap Creation 


We will use the H asse diagram representatio n of our computational mesh IKnepley 
and Karpeev 20091, the DMPlex class in PETSc ||Balay et al. 2014at Balay et al. 2014B)^ 


and describe mesh relations (adjacencies) with basic graph operations on a DAG. A 
distributed mesh is a collection of closed serial meshes, meaning that they contain 
the closure of each point, together with an “overlap structure”, which marks a subset 
of the mesh points and indicates processes with which these points are shared. The 
default PE TSc represent ation of the overlap information uses the SF class, short for 
Star Forest [ Brown 2011) . Each process stores the true owner (root) of its own ghost 
points (leaves), one side of the relation above, and construct the other side automati¬ 
cally. 


In order to reason about potential parallel mesh algorithms, we will characterize the 
contents of the overlap using the mesh operations. These operations will be under¬ 
stood to operate on the entire parallel mesh, identifying shared points, rather than 
just the local meshes on each process. To indicate a purely local operation, we will use 
a subscript, e.g. dlioc(p) to indicate the closure of a point p evaluated only on the local 
submesh. 


The mesh overlap contains all points of the local mesh adjacent to points of remote 
meshes in the complete DAG for the parallel mesh, and we will indicate that point p is 
in the overlap using an indicator function O. Moreover, if the overlap contains a point 
p on a given process, then it will also contain the closure of p, 

0{p)^^0{q) Vgec]l(p), (1) 

which shows that if a point is shared, its closure is also shared. This is a consequence 
of each local mesh being closed, the transitive closure of its Hasse diagram. We can 
now examine the effect of increasing the mesh overlap in parallel by including all the 
immediately adjacent mesh points to each local mesh. 

The set of adjacent mesh point differs depending on the discretization. For example, 
the finite element method couples unknowns to all other unknowns whose associated 
basis functions overlap the support of the given basis function. If functions are sup¬ 
ported on cells whose closure contains the associated mesh point, we have the rela¬ 
tion 


adj(p, g) g e dl(»t(p)), (2) 

where we note that this relation is symmetric. For example, a degree of freedom (doD 
associated with a vertex is adjacent to all dofs on the cells containing that vertex. We 
will call this FE adjacency. On the other hand, for finite volume methods, we typically 
couple cell unknowns only through faces, so that we have 

adj(p, g) <;=^ g e siupp(con(E(p)), (3) 

which is the common notion of cell-adjacency in meshes, and what we will call FV 
adjacency. This will also be the adjacency pattern for Discontinuous Galerkin meth¬ 
ods. 
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If we first consider FV adjacency, we see that the cone operation can be satisfied locally 
since local meshes are closed. Thus the support from neighboring processes is needed 
for all points in the overlap. Moreover, in order to preserve the closure property of local 
meshes, the closure of that support would also need to be collected. 

For FE adjacency, each process begins by collecting the star of its overlap region in the 
local mesh, »tioc(0). The union across all processes will produce the star of each point 
in the overlap region. First, note that if the star of a point p on the local processes 
contains a point q on the remote process, then q must be contained in the star of a 
point o in the overlap, 

q G »t(p) 3o I 0{o) Aq G »t(o). (4) 

There is a path from p to g in the mesh DAG, since q lies in star of p, which is the 
transitive closure. There must be an edge in this path which connects a point on the 
local mesh to one on the remote mesh, otherwise the path is completely contained in 
the local mesh. One of the endpoints o of this edge will be contained in the overlap, 
since it contains all local points adjacent to remote points in the DAG. In fact, q lies in 
the star of o, since o lies on the path from p to q. Thus, the star of p is contained in the 
union of the star of the overlap, 

st{p) G IJst(o). (5) 

O 

Taking the closure of this star is a local operation, since local meshes are closed. There¬ 
fore, parallel overlap creation can be accomplished by the following sequence: each lo¬ 
cal mesh collects the closure of the star of its overlap, communicates this to its overlap 
neighbors, and then each neighbor augments its overlap with the new points. More¬ 
over, no extra points are communicated, since each communicated point q is adjacent 
to some p on a remote process. 


2.2. Data Distribution 


We will recognize three b asic objects describing a parallel data layout: the 
Sectio n jBalay et al. 2014a) describing an irregular array of data and the SF, Star- 
Forest [ Brown 2011) , a one-sided description of shared data. A Section is a map from 
a domain of points to data sizes, or ndofs, and assuming the data is packed it can also 
calculate an offset for each point. This is exactly the encod ing strategy used in the 
Compressed Sparse Row matrix format | |Balay et al. 2014^ . An SF stores the owner 
for any piece of shared data which is not owned by the given process, so it is a one¬ 
sided description of sharing. This admits a very sparse storage scheme, and a scalable 
algorithm for assembly of the communication topology | |Hoefler et al. 2010) . The third 
local object, a Label, is merely a one-to-many map between integers, that can be ma¬ 
nipulated in a very similar fashion to a Section since the structure is so similar, but 
has better complexity for mutation operations. 


A Section may be stored as a simple list of (ndof, offset) pairs, and the SF as {Idof, rdof, 
rank) triples where Idof is the local dof number and rdof is the remote dof number, 
which means we never need a global numbering of the unknowns. Starting with these 
two simple objects, we may mechanically build complex, parallel data distributions 
from simple algebraic combination operations. We will illustrate this process with a 
simple example. 


Suppose we begin with a parallel cell-vertex mesh having degrees of freedom on the 
vertices. On each process, a Section holds the number of dofs on each vertex, and 
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Fig. 2. This figure illustrates the relation between different Section/SF pairs. The first column gives the 
domain space for the Section, the second the range space for the Section and domain and range for the SF. 
The Section and SF columns give the semantic content for those structures at each level, and the arrows 
show how the SF at each level can be constructed with input from below. Each horizontal line describes the 
parallel layout of a certain data set. For example, the second line down describes the parallel layout of the 
solution field. 


an SF lists the vertices which are owned by other processes. Notice that the domain 
(point space) of the Section is both the domain and the range (dof space) of the SF. We 
can combine these two to create a new SF whose domain and range (dof space) match 
the range space of the Section. This uses the PetscSFCreateSectionSFO function, 
which is completely local except for the communication of remote dof offsets, which 
needs a single sparse broadcast from dof owners (roots) to dof sharers (leaves), accom¬ 
plished using PetscSFBcastO. The resulting SF describes the shared dofs rather than 
the shared vertices. We can think of the new SF as the push-forward along the Section 
map. This process can be repeated to generate a tower of relations, as illustrated in 
Fig.@ 

We can illustrate the data structures and transformations in Fig.j^by giving concrete 
examples for the parallel mesh in Fig. Given the partition in the figure, we have an 
SF FFpoint, called Shared Points in Fig.M 

FFpoint = {/ -> (e, 1), e (/3,1), ()) -> ( 7 ,1)}, 

FFpoint = {e —>■ (/, 0), /3 —>■ (e, 0), 7 —>■ (</), 0)}, 

where the superscript denotes the process on which the object lives. Let us define a 
data layout for the solution to a Stokes problem using the Taylor-Hood |Taylor and 
Hood 197^ fi nite element scheme (P 2 -P 1 ). We define the Section S^, called Data Lay¬ 
out in Fig. 1^ 

5" = {c : (2,0), d : (2, 2), / : (2,4), e : (3, 6 ), 5 : (3, 9), ^ : (3,12)} 

51 = {a : (2,0),6 : (2,2),e : (2,4),a : (3,6),/? : (3,9),7 : (3,12)}. 

Using PetscSFCreateSectionSFO, we obtain a Section SF^of, called Shared Dof in 
Fig. 1^ giving us the shared dofs between partitions, 

FFdV = {4 ^ (4,1), 5 ^ (5,1), 6 ^ (9,1), 7 ^ (10,1), 8 ^ (11,1), 

12 ^(12,1), 13 ^(13,1), 14 ^(14,1)} 
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which we note is only half of the relation, and SF stores one-sided information. The 
other half which is constructed on the fly is 

SFlf = {4 ^ (4,0), 5 ^ (5,0), 9 ^ (6,0), 10 ^ (7,0), 11 ^ (8,0), 

12 ^ (12,0), 13 ^ (13, 0), 14 ^ (14,0)}. 


7 <(> 



Fig. 3. A parallel simplicial doublet mesh, with points on process 0 blue and process 1 green. 

We can use these same relations to transform any parallel data layout into another 
given an SF which connects the source and target point layouts. Suppose that we 
have an SF which maps currently owned points to processes which will own them af¬ 
ter redistribution, which we will call a migration SF. With this SF, we can construct 
the section after redistribution and migrate the data itself This process is show in 
Alg. which uses PetscSFCreateSectionSFO from above to transform the migra¬ 
tion SF over points to one over dofs, and also PetscSFDistributeSectionO to create 
the section after redistribution. The section itself can be distributed using only one 
sparse broadcast, although we typically use another to setup remote dof offsets for 
PetscSFCreateSectionSFO, as shown in Alg. 


Algorithm 1 Algorithm for migrating data in parallel 

1 : function MIGRATEDATA(sf, secSource, dtype, dataSource, secTarget, dataTarget) 
2 : PETSCSFDlSTRlBUTESECTlON(sf, secSource, remoteOff, secTarget) 

3: PETSCSFCREATESECTlONSFCsf, secSource, remoteOff, secTarget, sfDof) 

4 : PETSCSFBCAST(sfDof, dtype, dataSource, dataTarget) 


These simple building blocks can now be used to migrate all the data for a DMPlex ob¬ 
ject, representing an unstructured mesh of arbitrary dimension composed of cells, each 
of which may have any shape. The migration of cone data, coordinates, and labels all 
follow the general migration algorithm above, since each piece of data can be expressed 
as the combination of a Section, giving the layout, and an array storing the values, in 
PETSc a Vec or IS object. Small differences from the generic algorithm arise due to 
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Algorithm 2 Algorithm for migrating a Section in parallel 

1 : function DlSTRlBUTESECTlON(sf, secSource, remoteOff, secTarget) 

2 : <Calculate domain (chart) from local SF points> 

3: PETSCSFBCAST(sf, secSource.dof, secTarget.dof) > Move point dof sizes 

4: PETSCSFBCAST(sf, secSource.ofif, remoteOff) > Move point dof offsets 

5: PETSCSECTIONSETUP(secTarget) 


the nature of the stored data. For example, the cone data must also be transformed 
from original local numbering to the new local numbering, which we accomplish by 
first moving to a global numbering and then to the new local numbering using two 
local-to-global renumberings. After moving the data, we can compute a new point SF 
using Alg. 1^ which uses a reduction to compute the unique owners of all points. 


Algorithm 3 Algorithm for migrating a mesh in parallel 
1 : function MiGRATEfdmSource, sf, dmTarget) 

2 : ISL0CALT0GL0BALMAPPINGAPPLYBL0CK(12g, csize, cones, cones) 

3: [> Convert to global numbering 

4: PETSCSFBCASTfsf, 12g, 12gMig) t> Redistribute renumbering 

5: DMPLEXDlSTRIBUTECONESCdmSource, sf, 12gMig, dmTarget) 

6 : DMPLEXDlSTRIBUTECOORDINATESCdmSource, sf, dmTarget) 

7 : DMPLEXDlSTRlBUTELABELSfdmSource, sf, dmTarget) 


Algorithm 4 Algorithm for migrating an SF in parallel 

1 : function MiGRATESFfsfSource, sfMig, sfTarget) 

2 : PETSCSFGETGRAPHfsfMig, Nr, Nl, leaves, NULL) 

3: for p 0, Nl do 0 Make bid to own all points we received 

4: lowners[p].rank = rank 

5: lowners[p].index = leaves ? leaves[p] : p 

6: for p ^ 0, Nr do > Flag so that MAXLOC does not use root value 

7: rowners[p].rank = -1 

8: rowners[p].index = -1 

9 : PETSCSFREDUCEfsfMigration, lowners, rowners, MAXLOC) 

10 : PETSCSFBCASTfsfMigration, rowners, lowners) 

11 : for p ^ 0, Nl,Ng = 0 do 

12: if lowners[p].rank != rank then 

13: ghostPoints[Ng] = leaves ? leaves[p] : p 

14: remotePoints[Ng].rank = lowners[p].rank 

15: remotePoints[Ng].index = lowners[p]. index 

16: Ng++ 

17: PETSCSFSETGRAPHfsfTarget, Np, Ng, ghostPoints, remotePoints) 


2.3. Mesh Distribution 

Using the data migration routines above, we can easily accomplish sophisticated mesh 
manipulation in PETSc. Thus, we can redistribute a given mesh in parallel, a special 
case of which is distribution of a serial mesh to a set of processes. As shown in Alg. 
we first create a partition using a third party mesh partitioner, and store it as a labm. 
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where the target ranks are label values. We take the closure of this partition in the 
DAG, invert the partition to get receiver data, allowing us to create a migration SF 
and use the data migration algorithms above. The only piece of data that we need in 
order to begin, or bootstrap, the partition process is an SF which connects sending and 
receiving processes. Below, we create the complete graph on processes, meaning that 
any process could communicate with any other, in order to avoid communication to dis¬ 
cover which processes receive from the partition. Discovery is possible and sometimes 
desirable, and will be incorporated in a further update. 


Algorithm 5 Algorithm for distributing a mesh in parallel 
1: function DlSTRIBUTE(dm, overlap, sf, pdm) 

2 : PETSCPARTITIONERPARTITIONlpart, dm, IblPart) [> Partition cells 

3: DMPLEXPARTITIONLABELCLOSURE(dm, IblPart) > Partition points 

4: for p 0, P do > Create process SF 

5: remoteProc[p].rank = p 

6: remoteProclp].index = rank 

7: PETSCSFSETGRAPH(sfProc, P, P, NULL, remoteProc) 

8: DMPLEXPARTlTlONLABELlNVERT(dm, IblPart, sfProc, IblMig) 

9: [> Convert from senders to receivers 

10: DMPLEXPARTITIONLABELCREATESFCdm, IblMig, sfMig) 

11: > Create migration SF 

12 : DMPLEXMlGRATE(dm, sfMigration, dmParallel) t> Distribute DM 

13: DMPLEXDlSTRlBUTESF(dm, sfMigration, dmParallel) i> Create new SF 


We can illustrate the migration process by showing how Fig. is derived from Fig. 
We begin with the doublet mesh contained entirely on one process. In the partition 
phase, we first create a cell partition consisting of a Section Fcpart for data layout and 
an IS cpart holding the points in each partition, 

Fcpart = {0 :(l,0),l:(l,l)}, 

cpart = {B, A}, 

which is converted to the equivalent Label, a data structure better optimized for over¬ 
lap insertion, 

-depart — {0 {B}, 1 {A}}, 

and then we create the transitive closure. We can express this as a Section Fpart, called 
Point Partition in Fig. and IS part with the partition data, 

Vrt = {0:(4,0),l:(7,4)}, 
part = {B, c, d, S, A, a, b, e, a, /3, 7 }, 

or as the equivalent Label 

Lpart = {0 {B, c, d, 5}, 1 {A, a, b, e, a, /3, 7 }}. 

The bootstrap SF SFp^oc, called Neighbors in Fig. encapsulates the data flow for 
migration 

FFproc = {0-A (0,0),!^ (1,1)}. 

We have a small problem in that the partition structure specifies the send information, 
and for an SF we require the receiver to specify the data to be received. Thus we need to 
invert the partition. This is accomplished with a single call to DMPlexDistributeDataO 
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from Alg. which is shown in Alg. This creates a Section and IS with the receive 
information, 

^Lpart = {0 : (4,0)} 

invpart = {B,c,d, 6} 

^Lpart = {0 : (7,0)1 

invpart = {A, a, b, e, a, (3, 7 }. 

and then we convert them back into a Label hinvpart- This simple implementation for 
the complex operation of partition inversion shows the power of our flexible interface 
for data movement. Since the functions operate on generic representations of data (e.g. 
Section, SF), the same code is reused for many different mesh types and mesh/data 
operations, and only a small codebase needs to be maintained. In fact, the distribu¬ 
tion (one-to-many) and redistribution (many-to-many) operations are identical except 
for an initial inversion of the point numbering to obtain globally unique numbers for 
cones. 


Algorithm 6 Algorithm for inverting a partition 
1 : MiGRATEDataCS’J^ptoc -Spart, MPIU_2INT, part, 5'invpart, invpart) 


After inverting our partition, we combine Tinvpart and -S'Pproc using 
DMPlexPartitionLabelCreateSFO, the equivalent of PetscSFCreateSectionSFO, 
to obtain the SF for point migration 

5Fpoint = M^(A,l),B^(B,0), 

a —^ (a, 1 ), b —( 6 , 1 ), c —^ (c, 0 ), d — y (d, 0 ),c —^ (c, 1 ), 

a —>■ {a, l),d —>■ (/3, 1),7 ( 7 ) 1))<^ (^) 1)}- 

In the final step, this SF is then used to migrate all the (Section, array) pairs in the 
DMPlex, such as cones, coordinates, and labels, using the generic DMPlexMigrate () func¬ 
tion. 

2.4. Overlap Generation 

Following the initial distribution of the mesh, which was solely based on the parti- 
tioner output, the set of overlapping local meshes can now be derived in parallel. This 
derivation is performed by each process computing it’s local contribution to the set of 
overlap points on neighboring processes, starting from an SF that contains the initial 
point sharing. It is important to note here that this approach performs the potentially 
costly adjacency search in parallel and that the search space is limited to the set of 
points initially shared along the partition boundary. 

The algorithm for identifying the set of local point contributions to neig hboring parti¬ 
tions is based on the respective adjacency definitions given in section [23| As illustrated 
in Alg. 1^ the SF containing the initial point overlap is first used to identify connections 
between local points and remote processes. To add a level of adjacent points, the local 
points adjacent to each connecting point are added to a partition label similar to the 
one used during the initial migration (see Alg. [^, identifying them as now also con¬ 
nected to the neighboring process. Once the point donations for the first level of cell 
overlap are defined, further levels can be added through repeatedly finding points ad¬ 
jacent to the current donations. 
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Algorithm 7 Algorithm for computing the partition overlap 
1 : function DMPLExCREATEOvERLAP(dm, overlap, sf, odm) 

2: DMPLEXDlSTRIBUTEOWNERSHIP(dm, sf, rootSection, rootRank) > Derive 

sender information from SF 

3: for leaf ^ sf .leaves do > Add local receive connections 

4: DMPLExGExADJACENCYlsf, leafindex, adjacency) 

5: for a adjacency do 

6: DMLABELSETVALUE(lbl01, a, leafrank) 

7: for p^OjPdo 0 Add local send connections 

8: if rootSection[p] > 0 then 

9 : DMPLExGETADJACENCY(sf, p, adjacency) 

10: for a ^ adjacency do 

11: DMLABELSETVALUEdblOl, a, rootRank[p]) 

12: for n ^ 1, overlap do [> Add further levels of adjacency 

13: DMPLEXPARTITIONLABELADJACENCYdblOl, n) 


Having established the mapping required to migrate remote overlap points, we can 
derive a migration SF similar to the one used in Alg. As shown in Alg. 1^ this allows 
us to utilize DMPlexMigrateO to generate the overlapping local sub-meshes, provided 
the migration SF also encapsulates the local point renumbering required to maintain 
stratification in the DMPlex DAG, meaning that cells are numbered contiguously, ver¬ 
tices are numbered contiguously, etc. This graph numbering shift can easily be derived 
from the SF that encapsulates the remote point contributions, thus enabling us to ex¬ 
press local and remote components of the overlap migration in a single SF. 


Algorithm 8 Algorithm for migrating overlap points 
1: function DMPLEXDlSTRIBUTEOVERLAP(dm, overlap, sf, odm) 

2 : DMPLExCREATEOVERLAP(dm, sf, IblOl) 0 Create overlap label 

3: DMPLEXPARTlTlONLABELCREATESF(dm, IblOl, sfDl) > Derive migration SF 
4: DMPLEXSTRATIFYMlGRATIONSF(dm, sfOl, sfMig) c> Shift point numbering 

5: DMPLEXMlGRATE(dm, sfMig, dmOl) c> Distribute overlap 

6: DMPLEXDlSTRlBUTESF(dm, sfMig, dmOl) > Create new SF 


3. RESULTS 


The performance of the distribution algorithms detailed in Alg. and has been 
evaluated on the UK National Supercomputer ARCHER, a Cray XES^O with 4920 nodes 
connected via an Aries interconnect^ Each node consists of two 2.7 GHz, 12-core Intel 
E5-2697 v2 (Ivy Bridge) processors with 64GB of memory. The benchmarks consist of 
distributing a three dimensional simplicial mesh of the unit cube across increasing 
numbers of MPI processes (strong scaling), while measuring execution time and the 
total amount of data communica ted per processor. The mesh is generated in memory 
using TetGen | Si 2015 Si 2005|| and t he partitioner used is Metis/ParMetis | Karypis 
and Kumar 1998HKarypis et al. 2005) . 


The performance of the partitioning and data migration components of the initial one- 
to-all mesh distribution, as well as the subsequent generation of the parallel overlap 


^http://www.archer.ac.uk/ 
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Fig. 4. Performance of initial one-to-all mesh distribution of a 3D unit cube mesh with approximately 12 
million cells. The distribution time is dominated by the time to send the serial mesh to all processes, and 
the overlap determination and communication time scales linearly with the number of processes. 


mapping is detailed in Fig. The presented run-time measurements indicate that 
the parallel overlap generation scales linearly with increasing numbers of processes, 
whereas the cost of the initial mesh distribution increases due to the the sequential 
partitioning cost. 


Data communication was measured as the accumulated message volu me (sent and 
receive d) per process for each stage using PETSc’s performance logging | |Balay et al. 
2014a| . As expected, the overall communication volume during distributed overlap 
generation increases with the number of processes due to data replication along the 
shared partition boundaries. The communicated data volume during the initial distri¬ 
bution, however, remains constant, indicating that the increasing run-time cost is due 
to sequential processing, not communication of the partitioning. In fact, the number of 
high-level communication calls, such as SF-broadcasts and SF-reductions is constant 
for meshes of all sizes and numbers of processes. A model of the total data volume 
communicated during the initial distribution of a three-dimensional mesh can be es¬ 
tablished as follows: 


Vsf = 4B*N 

^inversion — Fs/ “F 2 ^ 4:13 A 
Vstratify = Vgf + 4B * N 

^partition — ^inversion ^stratify 


(6) 
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^cones 

^orientations 

^section 

^topology 

^coordinates 

^markers 

^migration 


Nc * 4i? * 4 + Nf * 4B * 3 + * 45 * 2 
Nc * 45 * 4 + Nf * 45 * 3 + N^. * 45 * 2 
3 * j + 2 * 45 * N 

^cones 4 “ ^orientations 4 “ ^section 

(3 * 85 + 2 * 45) * iV„ 

3 * Vsf 

^topology 4“ ^coordinates 4“ ^markers 


(7) 


where iVc, Nf,Ne and denote the number of cells, faces, edges and vertices re¬ 
spectively, N = Nc + Nf + Ne + Ny and 14/ is the data volume required to ini¬ 
tialize an SR The unit square mesh used in the benchmarks has Ny = 12,582,912, 
Nf = 25,264,128, N^, = 14,827,904, Ny = 2,146,689, resulting in VpartiUon ~ I.IG'5 and 
^migration ~ 2.8G5. 

As well as initial mesh distribution the presented API also allows all-to-all mesh dis¬ 
tribution in order to improve load balance among partitions. Fig. depicts run-time 
and memory measurements for such a redistribution process, where an initial bad 
partitioning based on random assignment is improved through re-partitioning with 
ParMETIS. Similarly to the overlap distribution, the run-time cost demonstrate good 
scalability for the partitioning as well as the migration phase, while the communica¬ 
tion volume increases with the number of processes. 




Fig. 5. Performance of all-to-all mesh distribution of simplicial meshes in 2D and 3D. An initial random 
partitioning is re-partitioned via ParMetis and re-distributed to achieve load balancing. 


As demonstrated in Fig. the sequential overhead of generating the base mesh on a 
single process limits overall scalability of parallel run-time mesh generation. To over¬ 
come this bottleneck, parallel mesh refinement can be used to create high-resolution 
meshes in parallel from an initial coarse mesh. The performance benefits of this ap¬ 
proach are highlighted in Fig. where regular refinement is applied to a unit cube 
mesh with varying numbers oredges in each dimension. The performance measure¬ 
ments show clear improvements for the sequential components, initial mesh genera¬ 
tion and distribution, through the reduced mesh size, while the parallel refinement 
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operations and subsequent overlap generation scale linearly. Such an approach is par¬ 
ticularly useful for the generation of mesh hierarchies required for multigrid precon¬ 
ditioning. 


10 "^ 


"•s-a ■ • ■ 

■-as 




Overlap, refine: 0, siz 
■ Distribute, refine: 0, s 
Generate, refine: 0, si 
Refine, refine: 0, size: 


• — - Overlap, refine: 1, size: 64 
Distribute, refine: 1, size: 64 
' ■ ' Generate, refine: 1, size: 64 
. Refine, refine: 1, size: 64 


■ - Overlap, refine: 2, size 
^ Distribute, refine: 2, si 

■ ' Generate, refine: 2, si; 

Refine, refine: 2, size: 


12 24 

Number of processors 


48 


Fig. 6. Performance of parallel mesh generation via regular three dimensional refinement. 


4. CONCLUSIONS 


We have developed a concise, powerful API for general parallel mesh manipulation, 
based upon the DMPlexMigrateO capability. With just a few methods, we are able to 
express mesh distribution from serial to parallel, parallel redistribution of a mesh, and 
parallel determination and communication of arbitrary overlap. Moreover, a user could 
combine these facilities to specialize mesh distribution for certain parts of a calculation 
or for certain fields or solvers, since they are not constrained by a monolithic interface. 
Moreover, the same code applies to meshes of any dimension, with any cell shape and 
connectivity. Thus optimization of these few routines would apply to the universe of 
meshes expressible as CW-complexes. In combination with a set of widely used mesh 
file format readers this provides a powerful set of tools for efficient mesh management 
available to a wide range of applications through PETSc library interfaces (Lange et al. 

HoTD. 


In future work, we will apply these building blocks to the problem of fully parallel 
mesh construction and adaptivity. We will input a naive partition of the mesh calcula¬ 
ble from common serial mesh formats, and then rebalance the mesh in parallel. We are 
deve loping an interface to the P ragmatic unstructured parallel mesh refinement pack¬ 
age (Rokos and Gorman 2013) , which will allow parallel adaptive refinement where 
we currently use only regular refinement. 
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